Spam filters: bayes vs. chi-squared; letters vs. words

نویسندگان

  • Cormac O'Brien
  • Carl Vogel
چکیده

We compare two statistical methods for identifying spam or junk electronic mail. Spam filters are classifiers which determine whether an email is junk or not. The proliferation of spam email has made electronic filtering vitally important. The magnitude of the problem is discussed. We examine the Naive Bayesian method in relation to the ‘Chi by degrees of Freedom’ approach, the latter used in the field of authorship identification. Both methods produce very promising results. However, the ‘Chi by degrees of Freedom’ has the advantage of providing significance measures, which will help to reduce false positives. Statistics based on character-level tokenization proves more effective than word-level.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching for Interacting Features for Spam Filtering

In this paper, we propose a novel feature selection method— INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization domain, Information Gain, Gain Ratio, Chi Squared, and ReliefF, are also used for performance comparison. Three classifiers, Support Vector Machine (SVM...

متن کامل

Good Word Attacks on Statistical Spam Filters

Unsolicited commercial email is a significant problem for users and providers of email services. While statistical spam filters have proven useful, senders of spam are learning to bypass these filters by systematically modifying their email messages. In a good word attack, one of the most common techniques, a spammer modifies a spam message by inserting or appending words indicative of legitima...

متن کامل

Exploration of Neuro-Fuzzy Spam Filtering based on Naive Bayes Filters

A text parser was used to calculate the statistical distribution of words within an email body. This information was used by a neurofuzzy system to determine the spam classification of the email. This process of detecting spam in an email was experimentally found to be 90% efficient. This design is exceptionally good as compared to present day filters based on its simplicity and limited scope o...

متن کامل

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

This paper explores the use of the naive Bayes classifier as the basis for personalised spam filters. Several machine learning algorithms, including variants of naive Bayes, have previously been used for this purpose, but the author’s implementation using wordposition-based attribute vectors gave very good results when tested on several publicly available corpora. The effects of various forms o...

متن کامل

Naive Bayes Spam Filtering Using Word Position Attributes

This paper explores the use of the naive Bayes classifier as the basis for personalized spam filters. Various machine learning algorithms, including variants of naive Bayes, have previously been used for this purpose, but the author’s implementation using word position based attribute vectors gives very good results when tested on several publicly available corpora. The effect of various forms ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003